神经普通微分方程(神经ODE)是残留神经网络(RESNETS)的连续类似物。我们研究了重新NET定义的离散动力学是否接近连续的神经颂歌。我们首先量化了Resnet的隐藏状态轨迹与其相应神经ODE的解之间的距离。我们的界限很紧,在负面的一侧,如果残留函数的深度不光滑,则不会以深度为0。在正面,我们表明这种平滑度是通过梯度下降来保留的,该梯度下降具有线性残留功能和足够小的初始损失的重新系统。它确保在n上以1的速率1均匀地沿速率1的速率和优化时间对极限神经的隐式正则化。作为我们分析的副产品,我们考虑使用不含内存的离散伴随方法来训练重新NET,通过通过网络的向后传动恢复激活,并证明该方法理论上在大深度上取得了成功,如果残留功能是带有输入的Lipschitz。然后,我们证明HEUN的方法是一种二阶Ode集成方案,当残留函数及其深度平滑时,使用伴随方法进行更好的梯度估计。我们通过实验验证我们的伴随方法在很大程度上取得了成功,并且Heun方法需要更少的层才能成功。我们最终成功地使用了伴随方法来微调非常深的重新连接,而无需残留层的内存消耗。
translated by 谷歌翻译
不平衡最佳传输(UOT)扩展了最佳传输(OT),以考虑质量变化以比较分布。这是使IT在ML应用程序中成功的至关重要,使其对数据标准化和异常值具有强大。基线算法陷入沉降,但其收敛速度可能比OT更慢。在这项工作中,我们确定了这种缺陷的原因,即缺乏迭代的全球正常化,其等效地对应于双口电的翻译。我们的第一款贡献利用了这种想法来开发一种可怕的加速陷阱算法(为UOT开发了一种可怕的陷阱算法(创建了“翻译不变的烟囱”),弥合了与OT的计算间隙。我们的第二次贡献侧重于1-D UOT,并提出了一个适用于这种翻译不变制剂的弗兰克 - 沃尔夫求解器。每个步骤的线性oracle都能求解1-D OT问题,从而导致每个迭代的线性时间复杂度。我们的最后贡献将这种方法扩展到计算1-D措施的UOT BaryCenter。数值模拟展示这三种方法带来的收敛速度改进。
translated by 谷歌翻译
过分分度化是没有凸起的关键因素,以解释神经网络的全局渐变(GD)的全局融合。除了研究良好的懒惰政权旁边,已经为浅网络开发了无限宽度(平均场)分析,使用凸优化技术。为了弥合懒惰和平均场制度之间的差距,我们研究残留的网络(RESNET),其中残留块具有线性参数化,同时仍然是非线性的。这种Resnets承认无限深度和宽度限制,在再现内核Hilbert空间(RKHS)中编码残差块。在这个限制中,我们证明了当地的Polyak-Lojasiewicz不等式。因此,每个关键点都是全球最小化器和GD的局部收敛结果,并检索懒惰的制度。与其他平均场研究相比,它在残留物的表达条件下适用于参数和非参数案。我们的分析导致实用和量化的配方:从通用RKHS开始,应用随机傅里叶特征来获得满足我们的表征条件的高概率的有限维参数化。
translated by 谷歌翻译
越来越多的机器学习问题,例如现有算法的鲁棒或对抗性变体,需要最小化自己定义为最大值的损耗函数。在(内部)最大化问题上携带随机梯度上升(SGA)步骤的环路,然后在(外部)最小化上进行SGD步骤,称为时期随机梯度\脑短幕(ESGDA)。虽然成功在实践中,ESGDA的理论分析仍然具有挑战性,但没有明确指导内部环路尺寸的选择,也没有内部/外部步长尺寸之间的相互作用。我们提出RSGDA(随机SGDA),是ESGDA的变种,具有随机环形尺寸,具有更简单的理论分析。 RSGDA在非透露X分钟/强凹幅最大设置上使用时,rsgda附带第一个(在SGDA算法中)几乎肯定的融合速率。 RSGDA可以使用最佳环路大小进行参数化,以保证已知为SGDA的最佳收敛速率。我们在玩具和更大的尺度问题上测试RSGDA,使用作为测试用最佳运输的分布鲁棒优化和单细胞数据匹配。
translated by 谷歌翻译
在数据集中定义样本之间有意义的距离是机器学习中的一个基本问题。最佳传输(OT)在样品之间提高特征(“地面度量”)到几何意义上的距离之间的距离。但是,通常没有直接的地面度量选择。有监督的地面度量学习方法存在,但需要标记的数据。在没有标签的情况下,仅保留临时地面指标。因此,无监督的地面学习是启用数据驱动的OT应用程序的基本问题。在本文中,我们首次通过同时计算样本之间和数据集功能之间的OT距离来提出规范答案。这些距离矩阵自然出现,作为函数映射接地指标的正奇异向量。我们提供标准以确保这些奇异向量的存在和独特性。然后,我们使用随机近似和熵正则化引入可扩展的计算方法以在高维设置中近似它们。最后,我们在单细胞RNA测序数据集上展示了Wasserstein奇异向量。
translated by 谷歌翻译
Non-linear state-space models, also known as general hidden Markov models, are ubiquitous in statistical machine learning, being the most classical generative models for serial data and sequences in general. The particle-based, rapid incremental smoother PaRIS is a sequential Monte Carlo (SMC) technique allowing for efficient online approximation of expectations of additive functionals under the smoothing distribution in these models. Such expectations appear naturally in several learning contexts, such as likelihood estimation (MLE) and Markov score climbing (MSC). PARIS has linear computational complexity, limited memory requirements and comes with non-asymptotic bounds, convergence results and stability guarantees. Still, being based on self-normalised importance sampling, the PaRIS estimator is biased. Our first contribution is to design a novel additive smoothing algorithm, the Parisian particle Gibbs PPG sampler, which can be viewed as a PaRIS algorithm driven by conditional SMC moves, resulting in bias-reduced estimates of the targeted quantities. We substantiate the PPG algorithm with theoretical results, including new bounds on bias and variance as well as deviation inequalities. Our second contribution is to apply PPG in a learning framework, covering MLE and MSC as special examples. In this context, we establish, under standard assumptions, non-asymptotic bounds highlighting the value of bias reduction and the implicit Rao--Blackwellization of PPG. These are the first non-asymptotic results of this kind in this setting. We illustrate our theoretical results with numerical experiments supporting our claims.
translated by 谷歌翻译
In order for artificial neural networks to begin accurately mimicking biological ones, they must be able to adapt to new exigencies without forgetting what they have learned from previous training. Lifelong learning approaches to artificial neural networks attempt to strive towards this goal, yet have not progressed far enough to be realistically deployed for natural language processing tasks. The proverbial roadblock of catastrophic forgetting still gate-keeps researchers from an adequate lifelong learning model. While efforts are being made to quell catastrophic forgetting, there is a lack of research that looks into the importance of class ordering when training on new classes for incremental learning. This is surprising as the ordering of "classes" that humans learn is heavily monitored and incredibly important. While heuristics to develop an ideal class order have been researched, this paper examines class ordering as it relates to priming as a scheme for incremental class learning. By examining the connections between various methods of priming found in humans and how those are mimicked yet remain unexplained in life-long machine learning, this paper provides a better understanding of the similarities between our biological systems and the synthetic systems while simultaneously improving current practices to combat catastrophic forgetting. Through the merging of psychological priming practices with class ordering, this paper is able to identify a generalizable method for class ordering in NLP incremental learning tasks that consistently outperforms random class ordering.
translated by 谷歌翻译
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
translated by 谷歌翻译
Neural Radiance Fields (NeRFs) are emerging as a ubiquitous scene representation that allows for novel view synthesis. Increasingly, NeRFs will be shareable with other people. Before sharing a NeRF, though, it might be desirable to remove personal information or unsightly objects. Such removal is not easily achieved with the current NeRF editing frameworks. We propose a framework to remove objects from a NeRF representation created from an RGB-D sequence. Our NeRF inpainting method leverages recent work in 2D image inpainting and is guided by a user-provided mask. Our algorithm is underpinned by a confidence based view selection procedure. It chooses which of the individual 2D inpainted images to use in the creation of the NeRF, so that the resulting inpainted NeRF is 3D consistent. We show that our method for NeRF editing is effective for synthesizing plausible inpaintings in a multi-view coherent manner. We validate our approach using a new and still-challenging dataset for the task of NeRF inpainting.
translated by 谷歌翻译
Traditional approaches to RL have focused on learning decision policies directly from episodic decisions, while slowly and implicitly learning the semantics of compositional representations needed for generalization. While some approaches have been adopted to refine representations via auxiliary self-supervised losses while simultaneously learning decision policies, learning compositional representations from hand-designed and context-independent self-supervised losses (multi-view) still adapts relatively slowly to the real world, which contains many non-IID subspaces requiring rapid distribution shift in both time and spatial attention patterns at varying levels of abstraction. In contrast, supervised language model cascades have shown the flexibility to adapt to many diverse manifolds, and hints of self-learning needed for autonomous task transfer. However, to date, transfer methods for language models like few-shot learning and fine-tuning still require human supervision and transfer learning using self-learning methods has been underexplored. We propose a self-supervised loss policy called contrastive distillation which manifests latent variables with high mutual information with both source and target tasks from weights to tokens. We show how this outperforms common methods of transfer learning and suggests a useful design axis of trading off compute for generalizability for online transfer. Contrastive distillation is improved through sampling from memory and suggests a simple algorithm for more efficiently sampling negative examples for contrastive losses than random sampling.
translated by 谷歌翻译